.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data=pd.read_csv("Data/train.csv")
data.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
pd.set_option('display.max_columns', None)
data.describe()
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | TotRmsAbvGrd | Fireplaces | GarageYrBlt | GarageCars | GarageArea | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1460.000000 | 1460.000000 | 1201.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1452.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1379.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
| mean | 730.500000 | 56.897260 | 70.049958 | 10516.828082 | 6.099315 | 5.575342 | 1971.267808 | 1984.865753 | 103.685262 | 443.639726 | 46.549315 | 567.240411 | 1057.429452 | 1162.626712 | 346.992466 | 5.844521 | 1515.463699 | 0.425342 | 0.057534 | 1.565068 | 0.382877 | 2.866438 | 1.046575 | 6.517808 | 0.613014 | 1978.506164 | 1.767123 | 472.980137 | 94.244521 | 46.660274 | 21.954110 | 3.409589 | 15.060959 | 2.758904 | 43.489041 | 6.321918 | 2007.815753 | 180921.195890 |
| std | 421.610009 | 42.300571 | 24.284752 | 9981.264932 | 1.382997 | 1.112799 | 30.202904 | 20.645407 | 181.066207 | 456.098091 | 161.319273 | 441.866955 | 438.705324 | 386.587738 | 436.528436 | 48.623081 | 525.480383 | 0.518911 | 0.238753 | 0.550916 | 0.502885 | 0.815778 | 0.220338 | 1.625393 | 0.644666 | 24.689725 | 0.747315 | 213.804841 | 125.338794 | 66.256028 | 61.119149 | 29.317331 | 55.757415 | 40.177307 | 496.123024 | 2.703626 | 1.328095 | 79442.502883 |
| min | 1.000000 | 20.000000 | 21.000000 | 1300.000000 | 1.000000 | 1.000000 | 1872.000000 | 1950.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 334.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 1900.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2006.000000 | 34900.000000 |
| 25% | 365.750000 | 20.000000 | 59.000000 | 7553.500000 | 5.000000 | 5.000000 | 1954.000000 | 1967.000000 | 0.000000 | 0.000000 | 0.000000 | 223.000000 | 795.750000 | 882.000000 | 0.000000 | 0.000000 | 1129.500000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 5.000000 | 0.000000 | 1961.000000 | 1.000000 | 334.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 | 2007.000000 | 129975.000000 |
| 50% | 730.500000 | 50.000000 | 69.000000 | 9478.500000 | 6.000000 | 5.000000 | 1973.000000 | 1994.000000 | 0.000000 | 383.500000 | 0.000000 | 477.500000 | 991.500000 | 1087.000000 | 0.000000 | 0.000000 | 1464.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 6.000000 | 1.000000 | 1980.000000 | 2.000000 | 480.000000 | 0.000000 | 25.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.000000 | 2008.000000 | 163000.000000 |
| 75% | 1095.250000 | 70.000000 | 80.000000 | 11601.500000 | 7.000000 | 6.000000 | 2000.000000 | 2004.000000 | 166.000000 | 712.250000 | 0.000000 | 808.000000 | 1298.250000 | 1391.250000 | 728.000000 | 0.000000 | 1776.750000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 7.000000 | 1.000000 | 2002.000000 | 2.000000 | 576.000000 | 168.000000 | 68.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 2009.000000 | 214000.000000 |
| max | 1460.000000 | 190.000000 | 313.000000 | 215245.000000 | 10.000000 | 9.000000 | 2010.000000 | 2010.000000 | 1600.000000 | 5644.000000 | 1474.000000 | 2336.000000 | 6110.000000 | 4692.000000 | 2065.000000 | 572.000000 | 5642.000000 | 3.000000 | 2.000000 | 3.000000 | 2.000000 | 8.000000 | 3.000000 | 14.000000 | 3.000000 | 2010.000000 | 4.000000 | 1418.000000 | 857.000000 | 547.000000 | 552.000000 | 508.000000 | 480.000000 | 738.000000 | 15500.000000 | 12.000000 | 2010.000000 | 755000.000000 |
pd.set_option('display.max_rows', None)
data.isnull().sum()
Id 0 MSSubClass 0 MSZoning 0 LotFrontage 259 LotArea 0 Street 0 Alley 1369 LotShape 0 LandContour 0 Utilities 0 LotConfig 0 LandSlope 0 Neighborhood 0 Condition1 0 Condition2 0 BldgType 0 HouseStyle 0 OverallQual 0 OverallCond 0 YearBuilt 0 YearRemodAdd 0 RoofStyle 0 RoofMatl 0 Exterior1st 0 Exterior2nd 0 MasVnrType 8 MasVnrArea 8 ExterQual 0 ExterCond 0 Foundation 0 BsmtQual 37 BsmtCond 37 BsmtExposure 38 BsmtFinType1 37 BsmtFinSF1 0 BsmtFinType2 38 BsmtFinSF2 0 BsmtUnfSF 0 TotalBsmtSF 0 Heating 0 HeatingQC 0 CentralAir 0 Electrical 1 1stFlrSF 0 2ndFlrSF 0 LowQualFinSF 0 GrLivArea 0 BsmtFullBath 0 BsmtHalfBath 0 FullBath 0 HalfBath 0 BedroomAbvGr 0 KitchenAbvGr 0 KitchenQual 0 TotRmsAbvGrd 0 Functional 0 Fireplaces 0 FireplaceQu 690 GarageType 81 GarageYrBlt 81 GarageFinish 81 GarageCars 0 GarageArea 0 GarageQual 81 GarageCond 81 PavedDrive 0 WoodDeckSF 0 OpenPorchSF 0 EnclosedPorch 0 3SsnPorch 0 ScreenPorch 0 PoolArea 0 PoolQC 1453 Fence 1179 MiscFeature 1406 MiscVal 0 MoSold 0 YrSold 0 SaleType 0 SaleCondition 0 SalePrice 0 dtype: int64
data["Alley"]=data["Alley"].fillna("None")
data["MasVnrArea"]=data["MasVnrArea"].fillna("None")
data["BsmtExposure"]=data["BsmtExposure"].fillna("No Garage")
data["BsmtFinType1"]=data["BsmtFinType1"].fillna("No Garage")
data["BsmtFinType2"]=data["BsmtFinType2"].fillna("No Garage")
data["FireplaceQu"]=data["FireplaceQu"].fillna("NoFirePlace")
data["GarageType"]=data["GarageType"].fillna("None")
data["GarageFinish"]=data["GarageFinish"].fillna("None")
data["GarageQual"]=data["GarageQual"].fillna("None")
data["GarageCond"]=data["GarageCond"].fillna("None")
data["GarageYrBlt"]=data["GarageYrBlt"].fillna("None")
data["PoolQC"]=data["PoolQC"].fillna("NoPool")
data["Fence"]=data["Fence"].fillna("NoFence")
data["MiscFeature"]=data["MiscFeature"].fillna("None")
data=data.dropna()
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1163 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1163 non-null int64 1 MSSubClass 1163 non-null int64 2 MSZoning 1163 non-null object 3 LotFrontage 1163 non-null float64 4 LotArea 1163 non-null int64 5 Street 1163 non-null object 6 Alley 1163 non-null object 7 LotShape 1163 non-null object 8 LandContour 1163 non-null object 9 Utilities 1163 non-null object 10 LotConfig 1163 non-null object 11 LandSlope 1163 non-null object 12 Neighborhood 1163 non-null object 13 Condition1 1163 non-null object 14 Condition2 1163 non-null object 15 BldgType 1163 non-null object 16 HouseStyle 1163 non-null object 17 OverallQual 1163 non-null int64 18 OverallCond 1163 non-null int64 19 YearBuilt 1163 non-null int64 20 YearRemodAdd 1163 non-null int64 21 RoofStyle 1163 non-null object 22 RoofMatl 1163 non-null object 23 Exterior1st 1163 non-null object 24 Exterior2nd 1163 non-null object 25 MasVnrType 1163 non-null object 26 MasVnrArea 1163 non-null object 27 ExterQual 1163 non-null object 28 ExterCond 1163 non-null object 29 Foundation 1163 non-null object 30 BsmtQual 1163 non-null object 31 BsmtCond 1163 non-null object 32 BsmtExposure 1163 non-null object 33 BsmtFinType1 1163 non-null object 34 BsmtFinSF1 1163 non-null int64 35 BsmtFinType2 1163 non-null object 36 BsmtFinSF2 1163 non-null int64 37 BsmtUnfSF 1163 non-null int64 38 TotalBsmtSF 1163 non-null int64 39 Heating 1163 non-null object 40 HeatingQC 1163 non-null object 41 CentralAir 1163 non-null object 42 Electrical 1163 non-null object 43 1stFlrSF 1163 non-null int64 44 2ndFlrSF 1163 non-null int64 45 LowQualFinSF 1163 non-null int64 46 GrLivArea 1163 non-null int64 47 BsmtFullBath 1163 non-null int64 48 BsmtHalfBath 1163 non-null int64 49 FullBath 1163 non-null int64 50 HalfBath 1163 non-null int64 51 BedroomAbvGr 1163 non-null int64 52 KitchenAbvGr 1163 non-null int64 53 KitchenQual 1163 non-null object 54 TotRmsAbvGrd 1163 non-null int64 55 Functional 1163 non-null object 56 Fireplaces 1163 non-null int64 57 FireplaceQu 1163 non-null object 58 GarageType 1163 non-null object 59 GarageYrBlt 1163 non-null object 60 GarageFinish 1163 non-null object 61 GarageCars 1163 non-null int64 62 GarageArea 1163 non-null int64 63 GarageQual 1163 non-null object 64 GarageCond 1163 non-null object 65 PavedDrive 1163 non-null object 66 WoodDeckSF 1163 non-null int64 67 OpenPorchSF 1163 non-null int64 68 EnclosedPorch 1163 non-null int64 69 3SsnPorch 1163 non-null int64 70 ScreenPorch 1163 non-null int64 71 PoolArea 1163 non-null int64 72 PoolQC 1163 non-null object 73 Fence 1163 non-null object 74 MiscFeature 1163 non-null object 75 MiscVal 1163 non-null int64 76 MoSold 1163 non-null int64 77 YrSold 1163 non-null int64 78 SaleType 1163 non-null object 79 SaleCondition 1163 non-null object 80 SalePrice 1163 non-null int64 dtypes: float64(1), int64(35), object(45) memory usage: 745.0+ KB
data["FireplaceQu"].value_counts()
NoFirePlace 571 Gd 319 TA 215 Fa 24 Ex 20 Po 14 Name: FireplaceQu, dtype: int64
data["RmsAbvGrd"]=data['FullBath'] + data['HalfBath'] + data['TotRmsAbvGrd']
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + data['2ndFlrSF']
data['Age'] = data['YrSold'] - data['YearBuilt']
data['GarageCond'] = data['GarageCond'].replace({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1, 'None': 0})
data["SalePrice"].describe()
count 1163.000000 mean 182470.369733 std 83251.315928 min 34900.000000 25% 129000.000000 50% 160000.000000 75% 215000.000000 max 755000.000000 Name: SalePrice, dtype: float64
sns.set(rc={'figure.figsize':(8,3)})
sns.histplot(data['SalePrice'], kde=True,color="orange")
<AxesSubplot:xlabel='SalePrice', ylabel='Count'>
data.corr()['SalePrice']
Id -0.042740 MSSubClass -0.096158 LotFrontage 0.350459 LotArea 0.313132 OverallQual 0.798380 OverallCond -0.129849 YearBuilt 0.536357 YearRemodAdd 0.515019 BsmtFinSF1 0.387365 BsmtFinSF2 -0.023806 BsmtUnfSF 0.191331 TotalBsmtSF 0.624018 1stFlrSF 0.627775 2ndFlrSF 0.307785 LowQualFinSF -0.025423 GrLivArea 0.705342 BsmtFullBath 0.220941 BsmtHalfBath -0.033800 FullBath 0.582506 HalfBath 0.270098 BedroomAbvGr 0.168081 KitchenAbvGr -0.109068 TotRmsAbvGrd 0.551535 Fireplaces 0.476748 GarageCars 0.649845 GarageArea 0.630398 GarageCond 0.253901 WoodDeckSF 0.336905 OpenPorchSF 0.314584 EnclosedPorch -0.169488 3SsnPorch 0.038737 ScreenPorch 0.117900 PoolArea 0.093132 MiscVal -0.031603 MoSold 0.049659 YrSold -0.011652 SalePrice 1.000000 RmsAbvGrd 0.616317 TotalSF 0.776461 Age -0.536046 Name: SalePrice, dtype: float64
sns.set(rc={'figure.figsize':(30,20)})
corr=data.corr()
sns.heatmap(corr, cmap='YlOrBr', vmin=-1, vmax=1,annot=True, fmt='.2f')
<AxesSubplot:>
corr = data.corr()
strong_corr = corr['SalePrice'][abs(corr['SalePrice']) > 0.5]
sns.set(rc={'figure.figsize':(12,6)})
sns.heatmap(data[strong_corr.index].corr(), annot=True, cmap='YlOrBr')
<AxesSubplot:>
data["GrLivArea"].corr(data["SalePrice"])
0.705342254866112
sns.set(rc={'figure.figsize':(6,6)})
sns.scatterplot(data=data,x=data['GrLivArea'],y=data["SalePrice"], color="orange")
<AxesSubplot:xlabel='GrLivArea', ylabel='SalePrice'>
data.groupby('OverallQual')["SalePrice"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| OverallQual | ||||||||
| 2 | 3.0 | 51770.333333 | 14254.200796 | 35311.0 | 47655.50 | 60000.0 | 60000.0 | 60000.0 |
| 3 | 14.0 | 84391.071429 | 25435.982249 | 37900.0 | 69375.00 | 83000.0 | 102500.0 | 126175.0 |
| 4 | 86.0 | 106518.767442 | 25206.920495 | 34900.0 | 88367.75 | 108729.5 | 124275.0 | 176000.0 |
| 5 | 319.0 | 131741.313480 | 26604.858223 | 55993.0 | 117750.00 | 132000.0 | 145000.0 | 225000.0 |
| 6 | 281.0 | 156759.576512 | 34730.874739 | 76000.0 | 134000.00 | 155900.0 | 177000.0 | 259500.0 |
| 7 | 261.0 | 207037.544061 | 45552.404146 | 82500.0 | 179000.00 | 200000.0 | 230000.0 | 383970.0 |
| 8 | 141.0 | 277665.035461 | 64219.617119 | 122000.0 | 236500.00 | 270000.0 | 312500.0 | 538000.0 |
| 9 | 42.0 | 368810.952381 | 81811.134066 | 239000.0 | 319925.00 | 347500.0 | 392074.0 | 611657.0 |
| 10 | 16.0 | 435193.187500 | 169778.833404 | 160000.0 | 334375.00 | 414430.5 | 495000.0 | 755000.0 |
sns.boxplot(x=data['OverallQual'],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='OverallQual', ylabel='SalePrice'>
data.groupby("MSZoning")["SalePrice"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MSZoning | ||||||||
| C (all) | 10.0 | 74528.000000 | 33791.092031 | 34900.0 | 43998.25 | 74700.0 | 98332.0 | 133900.0 |
| FV | 54.0 | 218930.814815 | 55165.134293 | 144152.0 | 174500.00 | 208600.0 | 251645.0 | 370878.0 |
| RH | 12.0 | 129494.500000 | 39821.492590 | 76000.0 | 98200.00 | 131500.0 | 148608.5 | 200000.0 |
| RL | 892.0 | 194470.776906 | 84812.231061 | 52000.0 | 138000.00 | 175000.0 | 228125.0 | 755000.0 |
| RM | 195.0 | 126275.020513 | 50219.368150 | 37900.0 | 100000.00 | 120000.0 | 138250.0 | 475000.0 |
sns.boxplot(data=data,x=data["MSZoning"],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='MSZoning', ylabel='SalePrice'>
print(data.groupby("Alley")["SalePrice"].median())
data.groupby("Alley")["SalePrice"].describe()
Alley Grvl 122000.0 None 164000.0 Pave 172200.0 Name: SalePrice, dtype: float64
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Alley | ||||||||
| Grvl | 46.0 | 122591.391304 | 35398.503360 | 52500.0 | 105375.0 | 122000.0 | 136750.0 | 256000.0 |
| None | 1079.0 | 185584.070436 | 84827.911788 | 34900.0 | 130000.0 | 164000.0 | 222000.0 | 755000.0 |
| Pave | 38.0 | 166542.736842 | 38590.669108 | 40000.0 | 151250.0 | 172200.0 | 184750.0 | 265979.0 |
sns.set(rc={'figure.figsize':(4,6)})
sns.boxplot(data=data,x=data["Alley"],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='Alley', ylabel='SalePrice'>
data.groupby('RmsAbvGrd')['SalePrice'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| RmsAbvGrd | ||||||||
| 4 | 13.0 | 97093.153846 | 35262.858071 | 35311.0 | 80000.0 | 86000.0 | 131500.00 | 155900.0 |
| 5 | 53.0 | 105874.018868 | 25237.549014 | 34900.0 | 89000.0 | 110000.0 | 125000.00 | 141000.0 |
| 6 | 153.0 | 124122.000000 | 35364.469850 | 52000.0 | 109900.0 | 123000.0 | 134800.00 | 385000.0 |
| 7 | 209.0 | 144889.875598 | 46345.500199 | 37900.0 | 123500.0 | 140000.0 | 160000.00 | 392500.0 |
| 8 | 203.0 | 176919.064039 | 53750.820659 | 75000.0 | 141500.0 | 170000.0 | 203500.00 | 374000.0 |
| 9 | 190.0 | 198288.742105 | 66909.020020 | 81000.0 | 153925.0 | 181000.0 | 234542.50 | 440000.0 |
| 10 | 136.0 | 208172.411765 | 69842.240489 | 87000.0 | 159000.0 | 189000.0 | 232750.00 | 423000.0 |
| 11 | 100.0 | 226486.750000 | 69706.048326 | 110000.0 | 179530.0 | 215000.0 | 264076.25 | 501837.0 |
| 12 | 50.0 | 256018.180000 | 96229.603393 | 90000.0 | 175500.0 | 255750.0 | 310250.00 | 555000.0 |
| 13 | 31.0 | 288718.838710 | 91848.880078 | 106000.0 | 249000.0 | 309000.0 | 346321.50 | 451950.0 |
| 14 | 13.0 | 441257.307692 | 189804.109524 | 154300.0 | 320000.0 | 424870.0 | 611657.00 | 755000.0 |
| 15 | 7.0 | 322690.142857 | 173431.890561 | 160000.0 | 190875.0 | 200500.0 | 480000.00 | 556581.0 |
| 16 | 5.0 | 354186.600000 | 190075.080160 | 200000.0 | 200000.0 | 250000.0 | 538000.00 | 582933.0 |
data['RmsAbvGrd'].corr(data['SalePrice'])
0.6163169162341261
sns.set(rc={'figure.figsize':(8,6)})
sns.boxplot(x=data['RmsAbvGrd'],y=data['SalePrice'],palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='RmsAbvGrd', ylabel='SalePrice'>
data.groupby('Foundation')['SalePrice'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Foundation | ||||||||
| BrkTil | 137.0 | 132987.934307 | 55424.527582 | 37900.0 | 102000.0 | 127000.0 | 153500.00 | 475000.0 |
| CBlock | 486.0 | 144841.370370 | 45765.603062 | 34900.0 | 119500.0 | 139000.0 | 160000.00 | 402861.0 |
| PConc | 532.0 | 229883.518797 | 90676.888880 | 78000.0 | 174000.0 | 209750.0 | 270000.00 | 755000.0 |
| Stone | 6.0 | 165959.166667 | 78557.704413 | 102776.0 | 109625.0 | 126500.0 | 233734.25 | 266500.0 |
| Wood | 2.0 | 153500.000000 | 14849.242405 | 143000.0 | 148250.0 | 153500.0 | 158750.00 | 164000.0 |
sns.barplot(x=data["Foundation"],y=data["SalePrice"],data=data,palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='Foundation', ylabel='SalePrice'>
print(f"Correlation: {data['TotalBsmtSF'].corr(data['SalePrice'])}")
sns.scatterplot(x=data['TotalBsmtSF'], y=data['SalePrice'], color="orange")
Correlation: 0.6240176156843997
<AxesSubplot:xlabel='TotalBsmtSF', ylabel='SalePrice'>
data.groupby("Neighborhood")["SalePrice"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Neighborhood | ||||||||
| Blmngtn | 14.0 | 191790.357143 | 30947.405296 | 159895.0 | 172875.0 | 186000.0 | 193775.75 | 264561.0 |
| Blueste | 2.0 | 137500.000000 | 19091.883092 | 124000.0 | 130750.0 | 137500.0 | 144250.00 | 151000.0 |
| BrDale | 16.0 | 104493.750000 | 14330.176493 | 83000.0 | 91000.0 | 106000.0 | 118000.00 | 125000.0 |
| BrkSide | 49.0 | 127399.489796 | 38475.963983 | 52000.0 | 105000.0 | 127500.0 | 141500.00 | 223500.0 |
| ClearCr | 12.0 | 212166.666667 | 56660.204623 | 130000.0 | 182750.0 | 205750.0 | 240375.00 | 328000.0 |
| CollgCr | 125.0 | 197625.328000 | 50108.821490 | 110000.0 | 164990.0 | 196500.0 | 221500.00 | 424870.0 |
| Crawfor | 41.0 | 203298.463415 | 69563.313243 | 90350.0 | 154900.0 | 188700.0 | 239000.00 | 392500.0 |
| Edwards | 83.0 | 130343.373494 | 41173.733466 | 58500.0 | 107250.0 | 126000.0 | 146700.00 | 320000.0 |
| Gilbert | 49.0 | 194059.306122 | 41871.184636 | 141000.0 | 175000.0 | 180000.0 | 192000.00 | 377500.0 |
| IDOTRR | 33.0 | 99093.333333 | 31962.801367 | 34900.0 | 81000.0 | 103000.0 | 120500.00 | 152000.0 |
| MeadowV | 15.0 | 98486.666667 | 24395.721717 | 75000.0 | 82750.0 | 88000.0 | 112000.00 | 151400.0 |
| Mitchel | 34.0 | 153886.352941 | 35947.821599 | 84500.0 | 130937.5 | 145000.0 | 170750.00 | 240000.0 |
| NAmes | 178.0 | 147864.511236 | 33021.925742 | 97000.0 | 129000.0 | 140000.0 | 159375.00 | 345000.0 |
| NPkVill | 7.0 | 144642.857143 | 8754.590633 | 127500.0 | 143000.0 | 147000.0 | 148500.00 | 155000.0 |
| NWAmes | 45.0 | 184211.111111 | 35510.936038 | 82500.0 | 165000.0 | 179900.0 | 202500.00 | 278000.0 |
| NoRidge | 33.0 | 337412.363636 | 130116.489801 | 190000.0 | 265000.0 | 290000.0 | 341000.00 | 755000.0 |
| NridgHt | 75.0 | 316214.506667 | 94665.046255 | 154000.0 | 253646.5 | 315000.0 | 373201.00 | 611657.0 |
| OldTown | 107.0 | 128643.074766 | 53870.859031 | 37900.0 | 105450.0 | 119900.0 | 140000.00 | 475000.0 |
| SWISU | 23.0 | 141947.130435 | 33859.866674 | 60000.0 | 127500.0 | 137450.0 | 164217.00 | 200000.0 |
| Sawyer | 46.0 | 136666.130435 | 20389.995805 | 62383.0 | 129000.0 | 135000.0 | 146975.00 | 180000.0 |
| SawyerW | 45.0 | 189342.600000 | 56601.924579 | 76000.0 | 153337.0 | 179900.0 | 228000.00 | 320000.0 |
| Somerst | 75.0 | 230586.880000 | 57637.330120 | 144152.0 | 184450.0 | 229456.0 | 258750.00 | 423000.0 |
| StoneBr | 20.0 | 335523.750000 | 111762.680173 | 180000.0 | 248000.0 | 322000.0 | 392608.00 | 556581.0 |
| Timber | 29.0 | 252498.379310 | 66764.973332 | 170000.0 | 187500.0 | 248900.0 | 305000.00 | 378500.0 |
| Veenker | 7.0 | 258714.285714 | 74470.272113 | 181500.0 | 202500.0 | 245500.0 | 297000.00 | 385000.0 |
sns.set(rc={'figure.figsize':(30,10)})
sns.barplot(x=data["Neighborhood"],y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='Neighborhood', ylabel='SalePrice'>
data.groupby("Fireplaces")["SalePrice"].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Fireplaces | ||||||||
| 0 | 571.0 | 141983.000000 | 44934.587316 | 34900.0 | 114502.0 | 135000.0 | 163750.0 | 342643.0 |
| 1 | 511.0 | 216776.514677 | 83977.283578 | 60000.0 | 158000.0 | 192000.0 | 262640.0 | 625000.0 |
| 2 | 77.0 | 250622.701299 | 133380.484825 | 110000.0 | 160000.0 | 210000.0 | 315000.0 | 755000.0 |
| 3 | 4.0 | 267500.000000 | 100041.657990 | 160000.0 | 193750.0 | 275000.0 | 348750.0 | 360000.0 |
data["Fireplaces"].corr(data["SalePrice"])
0.47674796573225514
sns.set(rc={'figure.figsize':(8,5)})
sns.boxplot(x=data["Fireplaces"], y=data["SalePrice"],palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='Fireplaces', ylabel='SalePrice'>
data["GarageArea"].corr(data["SalePrice"])
0.6303980450640168
sns.scatterplot(x=data["GarageArea"],y=data["SalePrice"], color="orange")
<AxesSubplot:xlabel='GarageArea', ylabel='SalePrice'>
print(data.groupby('GarageCond')["SalePrice"].describe())
plt.figure(figsize=(8, 10))
sns.boxplot(x='GarageCond', y='SalePrice', data=data,palette=['#FF9500', '#ffb30a'])
count mean std min 25% 50% \
GarageCond
0 67.0 106300.746269 33018.934527 34900.0 84500.0 104000.0
1 6.0 106416.666667 24079.901716 67000.0 98875.0 107000.0
2 31.0 112625.516129 31615.160605 40000.0 93000.0 114504.0
3 1052.0 190078.516160 83331.045428 35311.0 134500.0 170000.0
4 5.0 150080.000000 26745.223873 118500.0 132000.0 145000.0
5 2.0 124000.000000 4949.747468 120500.0 122250.0 124000.0
75% max
GarageCond
0 127500.0 200500.0
1 120000.0 137000.0
2 137250.0 177000.0
3 225000.0 755000.0
4 174900.0 180000.0
5 125750.0 127500.0
<AxesSubplot:xlabel='GarageCond', ylabel='SalePrice'>
data['Age'].describe()
count 1163.000000 mean 37.067928 std 31.993217 min 0.000000 25% 6.000000 50% 35.000000 75% 58.000000 max 136.000000 Name: Age, dtype: float64
sns.regplot(x=data['Age'],y=data['SalePrice'], color="#fb8b2ded")
<AxesSubplot:xlabel='Age', ylabel='SalePrice'>
data["TotalSF"].describe()
count 1163.000000 mean 2596.554600 std 821.976781 min 728.000000 25% 2038.000000 50% 2491.000000 75% 3034.000000 max 11752.000000 Name: TotalSF, dtype: float64
sns.scatterplot(x=data["TotalSF"],y=data["SalePrice"], color="orange")
<AxesSubplot:xlabel='TotalSF', ylabel='SalePrice'>
data.groupby('Electrical')['SalePrice'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Electrical | ||||||||
| FuseA | 76.0 | 121633.000000 | 37501.718480 | 34900.0 | 101750.0 | 123750.0 | 144125.0 | 239000.0 |
| FuseF | 19.0 | 110759.842105 | 28164.370151 | 58500.0 | 90875.0 | 122000.0 | 130750.0 | 145000.0 |
| FuseP | 2.0 | 105000.000000 | 45254.833996 | 73000.0 | 89000.0 | 105000.0 | 121000.0 | 137000.0 |
| Mix | 1.0 | 67000.000000 | NaN | 67000.0 | 67000.0 | 67000.0 | 67000.0 | 67000.0 |
| SBrkr | 1065.0 | 188345.065728 | 83905.356850 | 37900.0 | 132500.0 | 169000.0 | 224900.0 | 755000.0 |
sns.barplot(x="Electrical",y="SalePrice",data=data,palette=['#FF9500', '#ffb30a'])
<AxesSubplot:xlabel='Electrical', ylabel='SalePrice'>
.
The sales price follows a bell curve and is noramlly distributed with mean price of 182470.369733 and Standard Deviation of 83251.315928. So we can assume that majority of house price should be between 100000 to 250000.
Here the "Overall Quality of the house", "Living Area" and "Total Surface area" show very strong positive correlation with the price of the house
Most house in the dataset have overall 5 rating (average). As the rating increase the house price will increase significantly. Highest Average Price of houses have a 10 rating including the most costly house of the dataset and Houses with 9 rating have the smallest standard deviation.
Most number of houses are in zones with Low Density and Floating Village have the highest mean price.
House with no access to any alley have the highest average price followed by paved alley and gravel alley. The most expensive house in dataset have no alley access
The Price of a House is strongly correlated with the Number of Rooms and Bathrooms present.The most number of rooms in a house is 16, however most expensive house of the dataset(Price="755000.0") have 14 Total Rooms. Average Price of a House with 14 rooms is highest and House with 4 rooms is Lowest.
The most commonly used Foundations for a house are "Poured Contrete", "Cinder Block" and "Brick Tile". The most expensive type of Foundation for a house is "Poured Contrete" followed by "Wood" and "Stone". The most expensive house (Price="755000.0") is built with "Poured Contrete" and the cheapest house in the dataset (Price="34900.0") is made with "Cinder Block".
The area of basement and Price of the house shows strong positive correlation as shown in the Scatterplot that means we can conclude a larger area of the basement adds more value to the house and increases its sale price.
North Ridge region have the highest avegare sale price whereas the Meadow Village have the lowest average sale price. North Ames have the most number house on sale. North Ridge has the most expensive house(Price=755000.0) and Iowa DOT and Rail Road have the cheapest house(Price=34900.0) of the dataset.
There is a moderate positive correaltion between the number of Fireplaces and Price. Only 5 Houses have more than 3 Fireplaces and the Average Price of the houses are higher than the other houses. However The Most Expensive house of the dataset have only 2 Fireplaces so there must be other strong factors that affect the price.
The area of Garage is strongly correlated with Price and from the scatterplot we can see a positive slop. So, one can conclude that with increase in garage area the price of house will increase by some amount. The Condition of the Garage is also an Important factor here as well. Most of the Houses have 3 rating or Avegare Rating and have the highest avegare price.
Average age of the houses in the dataset is 37. The Regression Scatterplot show a Negetive Correlation between the age and Price which means the price of houses will decrease over time.
Total Surface Area of the House which includes all 1st, 2nd and Basement floor in Square feet is Strongly Correlated with the Price. Average size of the houses are 2596sq.ft and The house with biggest area indoor is 11752sq.ft big.
Most commonly used elelctric system is Standard Circuit Breakers & Romex , The most expensive house of the dataset also uses the electric system and the average value of the system is higher than the rest. Mix electric system is used in only one house.
https://pandas.pydata.org/docs/
https://matplotlib.org/stable/index.html
https://towardsdatascience.com/exploratory-data-analysis%028fc1cb20fd15
https://drive.google.com/drive/folders/1GCVMY7oZ53AMDs_tN3nzUviSpf1TvxlA?usp=sharing